MapReduce for accurate error correction of next-generation sequencing data

نویسندگان

  • Liang Zhao
  • Qingfeng Chen
  • Wencui Li
  • Peng Jiang
  • Limsoon Wong
  • Jinyan Li
چکیده

Motivation Next-generation sequencing platforms have produced huge amounts of sequence data. This is revolutionizing every aspect of genetic and genomic research. However, these sequence datasets contain quite a number of machine-induced errors-e.g. errors due to substitution can be as high as 2.5%. Existing error-correction methods are still far from perfect. In fact, more errors are sometimes introduced than correct corrections, especially by the prevalent k-mer based methods. The existing methods have also made limited exploitation of on-demand cloud computing. Results We introduce an error-correction method named MEC, which uses a two-layered MapReduce technique to achieve high correction performance. In the first layer, all the input sequences are mapped to groups to identify candidate erroneous bases in parallel. In the second layer, the erroneous bases at the same position are linked together from all the groups for making statistically reliable corrections. Experiments on real and simulated datasets show that our method outperforms existing methods remarkably. Its per-position error rate is consistently the lowest, and the correction gain is always the highest. Availability and Implementation The source code is available at bioinformatics.gxu.edu.cn/ngs/mec. Contacts [email protected] or [email protected]. Supplementary information Supplementary data are available at Bioinformatics online.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A survey of error-correction methods for next-generation sequencing

UNLABELLED Error Correction is important for most next-generation sequencing applications because highly accurate sequenced reads will likely lead to higher quality results. Many techniques for error correction of sequencing data from next-gen platforms have been developed in the recent years. However, compared with the fast development of sequencing technologies, there is a lack of standardize...

متن کامل

Karect: accurate correction of substitution, insertion and deletion errors for next-generation sequencing data

MOTIVATION Next-generation sequencing generates large amounts of data affected by errors in the form of substitutions, insertions or deletions of bases. Error correction based on the high-coverage information, typically improves de novo assembly. Most existing tools can correct substitution errors only; some support insertions and deletions, but accuracy in many cases is low. RESULTS We prese...

متن کامل

An Empirical Evaluation of Error Correction Methods and Tools for Next Generation Sequencing Data

Next Generation Sequencing (NGS) technologies produce massive amount of low cost data that is very much useful in genomic study and research. However, data produced by NGS is affected by different errors such as substitutions, deletions or insertion. It is essential to differentiate between true biological variants and alterations occurred due to errors for accurate downstream analysis. Many ty...

متن کامل

Recount: expectation maximization based error correction tool for next generation sequencing data.

Next generation sequencing technologies enable rapid, large-scale production of sequence data sets. Unfortunately these technologies also have a non-neglible sequencing error rate, which biases their outputs by introducing false reads and reducing the quantity of the real reads. Although methods developed for SAGE data can reduce these false counts to a considerable degree, until now they have ...

متن کامل

RACER: Rapid and accurate correction of errors in reads

MOTIVATION High-throughput next-generation sequencing technologies enable increasingly fast and affordable sequencing of genomes and transcriptomes, with a broad range of applications. The quality of the sequencing data is crucial for all applications. A significant portion of the data produced contains errors, and ever more efficient error correction programs are needed. RESULTS We propose R...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Bioinformatics

دوره 33 23  شماره 

صفحات  -

تاریخ انتشار 2017